The following anaylsis is based on the dataset which is P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The Wine Qualtiy dataset variables are as the following.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
The type of the variables are as the following.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Therefore, it is observed that the variable X indicates an index for every observation which is found in the dataset. Also, the other variables in the dataset are quantified by the use of numerical data. In addition, the quality variable is an integer.
A closer look on the variability in the numerical data is as the following.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
A boxplot can be used to visualize the variability of every variable as the following.
## Using as id variables
The lower and also the upper whiskers extend to the lowest and the highest points between 1.5 multiplied by the inter quartile range. Moreover, histograms can be used to plot every varaibale in order to help in understanding the distribution of every variable as the following.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most of the distrubution of the variables look normal. Three of the variables appear that they have lognormal distributions which are Alcohol, sulphates and total sulfur dioxide. Furthermore, It is somehow difficult to see the distribution because of the outliers for the two variables which are residual sugar and chlorides. In the following. 95th percentile is going to be excluded which belongs to the residual sugar and chlorides.Also, the histograms for both are going to reploted.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
After excluding the outliers for both residual sugar and chlorides, the distribution looks normal.
The statistical summary of residual sugar is as the following.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The statistical summary of chlorides is as the following.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The quality rating and the variables which are influencing the quality rating are the interest. Quality ratings are visualized as histograms as the following.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The most quality has a ranking which is between 5 and 6.
The content of the alcohol must be taken into account when people buy wine, so:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It is shown in the figure that the content of the alcohol has a lognormal distribution with a high peak on the lower part of the scale of the alcohol.
The relationship between every pair of the variables and their respective pearson product moment correlatio can be quickly visualized. The x and y axises’ names for the plot matrix are as the following.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
The highest four correlation coefficients with the quality are as the following. * alchol: quality = 0.48 * sulphates: quality = 0.25 * citric.acid: quality = 0.23 * fixed.acidity: quality = 0.12
The highest negative correlation coefficients with the quality are as the following. * volatile.acidity:quality = -0.39 * total.sulfur.dioxide:quality = -0.19 * density:quality = -0.17 * chlorides:quality = -0.13
Variables with the highest (positive or negative) correlation are as the following. * fixed.acidity:citirc.acid = 0.67 * fixed.acidity:density = 0.67 * free.sulfur.dioxide:total.sulfur.dioxide = 0.67 * alcohol:quality = 0.48 * density:alcohol = -0.50 * citric.acid:pH = -0.54 * volatile.acidity:citirc.acid = -0.55 * fixed.acidity:pH = -0.68
Having a closer look at the relationships with more details which are density and alcohol.
As shown the density increases when the content of the alcohol decreases.
When fixed acidity and pH.
As shown the fixed density increases when the pH decreases.
When fixed acidity and density.
As shown the fixed acidity increases when the density increases
Having a closer look at the content of the alcohol by the wine’s quality with the use of a density plot function.
It is shown that the wine with a high content of alcohol has the tendency to have a high rating for quality. Also, it appears to be having a ranking for quality of 5.
The statistical summary for the content of the alcohol for every quality level is as the following.
## factor(dataset$quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## factor(dataset$quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## factor(dataset$quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## factor(dataset$quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## factor(dataset$quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## factor(dataset$quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
It is shown that the content of the sulphate is important when it comes to the wine. In particular, for the high levels of quality which includes 7 and 8 qualities.
The statistical summary for the sulphates of the alcohol for every quality level is as the following.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
Visualizing of the relationship among sulphates, volatile.acidity, alcohol and quality is as the following.
It is shown that the high quality of wine is mostly concentrated in the top left part of the plot. Also, this means that the high content of alcohol is also there which is represented by large dots.
Summarizing of the quality by the use of a contour plot for the content of alcohol and sulphate is as the following.
It is shown that the high quality of wine is mostly located in the top right part of the plot which is represented by darker contour lines. However, the low quality of wine is mostly located in the bottom right part of the plot.
Visualizing of the quality by the use of density plots which is along the along x an y axises and color is as the following.
It is shown that the high quality of wine is mostly located in the top right part of the plot.
In the coming parts, a summary of the main findings with refined plots.
The highest correlation coefficient was among the quality and the alcohol. Having a closer look on the content of the alcohol by the quality of the wine with the use of density plot function. This is as the following.
As shown the density plots for the high quality of wine which are shifted right. They are indicated by red plots which means that these have a comparative high content of alcohol. This is compared to the low quality of the wine. Also, wine appears to be having a ranking for quality of 5.
The statistical summary of the content of alcohol for every level of quality is as the following.
## dataset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## dataset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## dataset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## dataset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## dataset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## dataset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Also, it is found out that sulphates are correlated with the quality of the wine which is ( R^2 = 0.25 ). Whereas, the volatile acid has a negative correlation which is ( R^2 = -0.39 ). The following scatter plot represents the relationship between the sulphates and the volatile acid. Also, with content of the alcohol and the quality of the wine.
It is shown that the high quality of wine is mostly concentrated in the top right part of the plot. Also, there are large dots which are concentrated in the same area.
The summary of the content of the alcohol by the quality of rating is as the following.
## dataset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## dataset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## dataset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## dataset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## dataset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## dataset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
By the content of the sulphate is as the following.
## dataset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## dataset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## dataset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## dataset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## dataset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## dataset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
By the volatile acidity is as the following.
## dataset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## dataset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## dataset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## dataset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## dataset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## dataset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
The following density plot represents the relationship between the content of the alcohol and the sulphates by a combination of scatter plot and density plot.
This dataset which is expressed in this exercise includes 1599 information about different wines with 12 variables. Firsly, the start in this project was by understanding and analyzing the variables. Secondly, exploring questions to make observations on the plots. Finally, analyzing the wine’s quality between the variables.
The analysis in this project has considered the relationship of the attributes of the wine with the quality of different types of wines. In addition, melting the dataframe and also using facet grids have been very helpful in visualizing the distributions of every parameter by using boxplots and histograms. Majority of the parameters are distributed in a normal way. Whereas, citirc acid, free sulfur dioxide, total sulfur dioxide and alcohol have a tendency of a lognormal distribution.
The future work can be developing a model to analyze and predict the quality of wine according to the same dataset.